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Abstract 



iir «meral-Darpose methods for recognizing 

methods are implemented wy^JjS S 
that incorporates a notion of textual simuarrty devei 
aMd T the information retrieval community. In an 
SerimeTtal evaluation ~*™JX+2$.!2Z 
tan tanked first by our method is "meaningful ^» 
a ^uSure that wL used in a Hand-coded ^PPf^ 
fraction program, for the 70% , 

This improves on a value of 50% obtamed by an 
Star method- With appropriate 
nation, the structure-recognition methods we d^cr be 
«Talso be used to learn a wrapper from examples, 
offcfmamtahSg a wrapper as a Web page changes 
Simat^theseLtings, the top-ranked structure ,6 
meaningful nearly 85% of the time. 

Introduction 

Web-based information Integration systems allow a user 
Tquery structured information that has 
from tne Web (Levy, Rajaraman, & O'dilte 1996, 
Garcia-Mollna et al. 1995; Knoblock et aL 1998, 
G^ett Keller, It Dushka 1997; Sahugue t 
k Chandrasekar 1998; Mecca et aL 199V J° ma *lf 
oL 1997) In most such systems, a diferent urop- 
tr must be written for each Web site that is ac- 
cused A wrapper is a special-purpose program that 
extracts information from Web pages written in a spe- 
cific format. Because data can be pr eserfed m many 
different formats, and because Web W^»J™J 
change, building and maintaining wrappers is tfme- 
consummg and tedious. To reduce the cost of budd- 
Sg w^ers, some researchers have F"*^*"* 
languages for writing wrappers (Hammer et «i. 1997^ 
SS!lW8M or semi-automated tools for wrapper 
™£^(teL?£K»ote*lSSn Others have 
implemented systems that allow yWJJ^Jj^ 
from examples (Kushmenck, Weld, 
Hsu 1998; Muslea, Minton, & Knoblock 199BM»J 
exchange standards like XML have also been proposed, 
although as yet none are In wide^readiwj. 

Here we explore another approach to this problem. 
devSin?g^al-purpose methods for automat,cally 

Copyright © 1999, American Association for Artificial In- 
telligence. All rights reserved. 
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Figure 1; Nonsense text with a meaningful structure. 

recognizing structure in HTML documents. Our ul- 
tSate goal Is to extract structured information . from 
^a|« without any page-specific programming or 

this approach, 

a human reader, this text is r*^££$Z£%« 
i««t of three items, each containing the itanzea name 
?umveS department, with the university name un- 
SerniTTht aSarently meaningful structure is rec- 
ord without previous knowledge or trainmen 
flKthe text Is ungrammatical nonsense and the unw 
veSy names are Imaginary. This suggest .that peo- 
ptremploy general-purpose, page-mdependent strate- 
gies fo?Sc!gnl Z ing structure in doomieBta. Incor- 
porating similar strategies into a system that auto- 
VS8, (or semi-automatically) constructs wrappers 
would clearly be valuable. recoen ition 
Below we show that effective rtroC *^f R / e ^°^™ 
«,»thftd3 for certain restricted types of list structures 
Si? beaded Smpactly and naturally, given appro- 
ve tS In partiLlarl we will present 
Ss that can be concisely implemented in WHIRL (Co- 
iZ* i qc-rO a "soft" logic that includes both sou 
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HTML source for a simple list: 
<htol><head>- - - </head> 

^l>Mitariai Board Member8</hl> 
<tafele> <tr> 

<td>G- R. Enxltn, Lxicent</td> 

<td>Harry Q. Bovik, Cranberry U</tdx/tr> 

< <td>Bat Gangiey, UC / Bovine </td> 
<td>Pheobe L. Mind, Lough Tecn</td> 

Extracted cUtaj^ 



HTML source foT a dimple hotlist; 
<html><he8d>...</tead> 
<body><nl>Publications for Pheobe Mmd</nl> 

<U>Optimization of fuaay neural networks using 
dttMd parallel case-based genetic knowledge discovery 

(<a bref-'Wz.pdT >PDF</a>)</U> 
<U> A Unear-time version of GS AT 

(<& href=» tt peqnp.p8 r, >postscrlpt</a>)</li> 



" TE R. ^inxUn, Lucent 
Harry Q. feovik, Cranberry u 





buzz.pdf 


A linear-time version of - - . 


peqnp.ps 







Figure 2: A simple list, a simple hotlist, and the date that would be extracted torn each. 

ing a set of pairs («i,«x), • • • , j'^H^^jse 
text that falls below some node m in the HTML parse 
M and each m is a URL that Is associated with some 
HTML anchor element ch that appears somewhere m- 
Sde Figure 2 shows the HTML source for a simple 
m and a simple hotlist, and the data that vs extracted 

from each, A 
This restriction is based on our experience with a 
working information integration system (Coheir 19 98b) . 
Of 111 different wrapper programs written for this sys- 
tem 82 (or nearly 75%) were based on simple hats or 
Sple heists, a/defined above.* We will use this cor- 
pus of problems in the experiments described below. 

The vector space representation for text 
Our ability to perceive structure in the text of Figure 1 
is arguably enhanced by the regular appearance of sub- 
strings that are recognizable as (fictitious) university 
names. These strings are recognizable because they 
"look like" the names of real universities. Implement- 
ing such heuristics requires a precise notion i of similarity 
foFtext, and one such notion is provided by the vector 
space model of text. 

In the vector space model, a piece of text is Repre- 
sented as a document vector (Salton 1989). We as- 
sume a vocabulary T of teniu; in this paper, terms 
are word stems produced by the Porter stemming algo- 
rithm (Porter 1980). A document vector is a vector oi 
real numbers v € T^ T K each component of which corre- 
sponds to a term t € T. We will denote the component 
of v which corresponds to t 6 T by , and 
TF-1DP weighting scheme (Salton 1989): for a docu- 
ment vector v appearing in a collection C, we tot t.. be 
zero if the term t does not occur in text repreeen ed by 
tT, and otherwise let v t = (log(Ti^) + 1) ■ <*(ff'); 
In this formula, TF<r, t is the number of times that 



posed "structures" found in the page. This ranking is 
generally quite useful: in an experimental evaluation 
on 82 Web pages associated with real extraction prob- 
lems the top-ranked structure is "meaningful (as de- 
fined below) nearly 70% of the time. This .improves on 
an earlier method (Cohen & Fan 1999), which proposes 
meaningful structures about 50% of the time on the 
same data. . 

By providing different types of additional informa- 
tion, about a page, the same methods can also be used 
for page-specific wrapper learning as proposed by Kush- 
meric et al (1997), or for updating a wrapper after the 
format of a wrapped page has changed. When used 
for page-specific learning or wrapper update, the top- 
ranked structure is meaningful nearly 85% of the time. 

Background 
Benchmark problems 

We begin by clarifying the structure-recognition prob- 
lem, with the aim of stating a task precise enough to 
allow quantitative evaluation of performance. Defer- 
ling for now the question of what a "structure" is, we 
propose to rate the "structures" identified by our meth- 
ods as either meaningful or not meaningful. Ideally, a 
structure in a Web page would be rated as meanmgful^ iff 
it contains structured information that could plausibly 
be extracted from the page. Concretely, in our exper- 
iments, we will use pages that were actually wrapped 
by an Information integration system, and consider a 
structure as meaningful iff it corresponds to information 
actually extracted by an existing, hand-coded wrapper 

for that page. 

In this paper, we will restrict ourselves to wrappers 
in two narrow classes (and therefore, to a narrow class 
of potential structures). We call these wrapper classes 
simple lists and simple hotlists. In a page containing a 
simple list, the information extracted is a one-column 

relation containing a set of strings si «w. and each 

si is all the text that falls below some node m m the 
HTML parse tree for the page. In a simple hotlist, the 
extracted information Is a two-column relation, contain- 



1 We sav "based on" because some lists also included pre- 
«roc«siM or filtering steps. We note also that the relative 
£££££ 'of mEn is due in part to special properties 

this dataset can be found elsewhere (Cohen & Fan 1999). 
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term t occurs in the document represented by 0, and 
IDFi ± ^ where C« is the set of documents in C 

that contain t. . ^ 

In the vector space model, the smnlaritu . of two 
document vectors tT and w is pven by the formula 
" „rri _ ^ . "jr«? 4 — Notice that SIM{v,w) 
SlM{v t aJ) = Liter pflFflSfr „ . .,,1^,1 

is always- between zero and one, and that » 
Urge only when the two vectors share many "impor- 
tant" (highly weighted) terms. 

The WHIRL logic 

Overview. WHIRL is a logic in which the fundamen- 
tal items that are manipulated are not . 
but entities that correspond to fragments of text . Bach 
fragment is represented internally as ^document vector 
Is defined above; this means the "^g^g**" 1 ^ 
two items can be computed In brief , Wffltt « ^on- 
recursive, function-free Prolog, with the addition of a 
built-in similarity predicate; rather than being true or 
false, a similarity literal is associated with a 
"score" between 0 and 1; and scores are combined as if 
they were independent prob abilities. 

As an example of a WHIRL query, let us suppose that 
the information extracted from the simple list of Fig- 
ure 2 is stored as a predicate ed-boa^X). ^Suppose ako 
that the information extracted from the hotUst of Fig- 
ure 2, together with a number of similar bibbography 
hotlists, has been stored in a predicate paper(Y,Z,UJ, 
where Y is an author name, Z a paper title, and U a 
paper URL. For instance, the following .facte may have 
been extracted and stored: ed.board(J heobe L Mind, 
Lough Tech"), *»* paperCPheobe Mind*, <A kn*r- 
time version of GSAT", «http://.../mnpV S °). Usmg 
WHIRL'S similarity predicate the following query 
might be used to find papers written by editorial board 
members: 

<- ed-board(X) A paper(Y,Z,U) AM 
The answer to this query would be alist of substitutions 
0, each with an associated score. Substitutions that 
bind X and Y to simUar documents would be scored 
higher. One high-scoring substitution might bind X 
to "Pheobe L. Mind, Lough Tech" and Y to Theobe 

^Bdow we will give a formal summary of WHIRL. A 
complete description is given elsewhere (Cohen 1998a). 
WHIRL semantics. Like a conventional deductive 
database (DDB) program, a WHIRL program consists 
of two parts: an «ieusW database (EDB), and an 
data^e (IDB). The IDE is a non-re^mve 
set of function-free definite clauses. The EDB is a col- 
Son of ground atomic facts, each plated witba 
numeric score in the range (O.lJ. In additi on to the 
Jypes of literals normally allowed in a , DDB^ daus es in 
the IDB can also contain *itm<anty Werafo^e form 
X ~Y where X and Y are variables. A WHIRL pred- 
icate definition U called a view. We will a^f^ d °* 
Sews are /fot-that is, that each clause body in the 



view contains only literals associated with Predicates 
Jtfned in the EDB. Since WHIRL ^o^pport 
recursion, views that are not flat can be flattened 
(unfolded) by repeated resolution. . 

to a conv7nticmal DDB, the answer to 
query would be the set of ground substitutions that 
Sthe query true. In WHIRL, the notion of provabj 
r^wffl be replaced with a "soft" notion o [score, which 
5 wS now define. Let 6 be a ground sfstitufcon Jo 
B If B « f{Xu.'.M ^esp°nds to a predicate 
defined in the EDB , then SCORE{Bp^ BB * a 
fact in the EDB with score and SCORB^V) - u 
otherwise. If B is a similarity literal X ~ V. then 
SC0RB(B,8) = SIM&V), where x = X0™.V-™ 
KB = Bi A . . . A B k is a conjunction of literals, then 
SCORER) = ^SCORE^BY J*^} 
aider a WHIRL view, defined as a set of clauses oi 
the form Ai <- Body,. For a ground 
an instance of one or more At% we define the sup 
™J ToH SUPPORTS), to be the set of all pavrs 
% Body ) such that A,c = a, Body,* is ground, and 
sboRBiBody^) > 0. We define the score of an atom 
a (for this view) to be 

x _ JJ (l-SCORE{Bod yi ,e)) 

{a.Body,)eSUPPORT{o) 

This definition follows from the usual semantics of logic 
programs, together with the observation that if ei and 
e 2 are independent events, then Prob^ v e 2 ) - 1 - 
(1-Prob(e0)(l-Prob(e2))- , ., . .„ 
The operations most commonly P^med In 
WHIRL are to define and materialize views J 0 .™* 
terialize a view, WHIRL finds a set of ground atoms 
o with non-zero score s a for that view and adds them 
to the EDB. Since in most cases, only high-scoring an- 
£ e „ w£ be of interest, the materialization operator 
takes two parameters: r, an upper bound on the num- 
ber of answers that are generated, and *, a lower bound 
on the score of answers that are generated. 

Although the procedure used for combining scores m 
WlS 5 naive', inference in WHIRL can be imple- 
mented quite efficiently. This is particularly true if e 
, 5 large or r is smaU, and if certain approximations are 
allowed (Cohen 1998a). 

The "many" construct. The structure-recognition 
methods we will present require a recent extension to 
the WBTRL logic: a "soft" version of universal qu^tifi- 
caUon. This operator is written ^V^f*' 0 ^ J"? 
where the Test is an ordinary conjunction of liter- 
JbTLd th -.Template is a single literal of the form 
Y ) where p is an EDB predicate and the Y { s 
^au'dbT^tTKJe «1 ma/appear only in Test 
The scow of a "many" clause is the weighted average 
S re TthrTes^on!unction on items that match the 
Satl More formally, for a substitution 6 and a 
conjunction W> 
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! 

£ l-SCORE(Tcst,{0o{Yi = 

where P is the set of all tuples (s, 01, . . . , a fc > such that 
„/T is a fact m the EDB with score *; S is the 

fi'rf and {Yi = -,)« denotes the 

substitution (i\ = «ii • • • t Y * . „ _ 

As an example, the following WHIRL query « a re- 
quest for editorial board members that have written 
"many" papers on neural networks. 

q (X) *- ed-board(X) A 
many(papers(Y,Z,W), 

(X~Y* Z~ "neural networks ') ). 

Recognizing structure with WHIRL 
Encoding HTML pages and wrappers 

We will now give a detailed description of how 
atructure-recognition methods can be encoded in 
WHIRL. We begin with a description of the encoding 

used for an HTML page. _„„ T , . 

To encode an HTML page in WHIRL, the page is 
first parsed. The HTML parse tree is then represented 
with the foUowing EDB predicates. 
• elUId Tag.Text.Positxon) is true if Id is the identifier 
for a parse tree node, n, Tag is the HTML tag asso. 
dated with n, Text Is all of the text appearing m the 
subtree rooted at n, and Position is the sequence of 
tags encountered in traversing the path from the root 
to n The value of Position is encoded as a a docu- 
ment containing a single term t po ., which represents 
the sequence, e.g., t pM = a htmLbody-vLh . 
. attr(Id,AName,AValue) is true if Id is (the identifier 
for node m ANameis the name of an HTML attribute 
associated with n, and AValue is the value of that 
attribute. 

. path(FromJd,ToId,Tags) is true if Tagsjs the se- 
ouence of HTML tags encountered on the path be- 
tween nodes Fromld and Told. This path includes 
both endpoints, and is defined if FromldsToId. 
As an example, wrappers for the pages in Figure 2 can 
be written using these predicates as follows. 

pagel(NameAfftl) <- ,, ^ ,,„, 

elt(_, ., NameAffii, "htmLbody.table.tr Jd"). 
page£(Titie,Url) «- 

elt ContextElt, _, Title, «htmLbody.uUt") 

A path(ContextElt, AnchorElt, "li-a") 

A attr(Anchor£lt, "href, Url). 

Next, we need to introduce an appropriate encod- 
ing of -structures" (and in so doing, make this notion 
precise.) Most simple lists and hotlists in our bench- 
mark collection can be wrapped with some variant of 
either the pagel or pageS view, in which the constant 
slrings (e.£ •tamLta*L«U? and "KV) are replaced 
with different values. Many of thercmaming pages ^can 
be wrapped by views consisting of a disjunction of such 
clauses. 



We thus introduce a new construct to formally repre- ; 
senuSXS idea of a "structure" in a structured 
document: a copper piece. In the *f * « a *** 
ting a wrapper piece consists of a dame tempUUe {e.g 
r g eneriTvSsion of pageS above) , and a set of temp ate 
JLmeters (e.g., the pair of constants "htmLbody.»Ui> 
Sd^Sf) the experiments below, we consider 
only two clause templates-the ones suggested by the 
Spies above-sand also assume that the recognizer .. 
S for each page, if it should look for list structures 
Sst structured. In this case, the clause template 
need not be explicitly represented; a wrapper piece for 
I pageS variant can be represented simp y as a .jm 
of instants {e.g., «htmU>ody-uLl{> and "h-a"), and a 
^perVece for'a pagel variant 
a single constant {e.g., httrd.body tableJr.td). 

To? brevity, we will confine the discussion below to 
mihods that recognize simple hotlist structures analo- 
eousto pflffoJB, and will assume that structures are en- 
coded by^^air of constants Pathl and Path2. However 
mS of the methods we will present have direct analogs 
that recognize simple lists. 

Enumerating and ranking wrappers 

We will now describe three structure-recognition meth- 
oS based on these encodings. We begin with some ba- , 
sic building blocks. Assuming that some page of inter- « 
S has befn encoded in WHIRL'S EDB, mater.aluing > 
the WHIRL view potsible-piece, shown m Figure 3, will , 
SneVateaU wrapper pieces that would extract at least 
one £m from the pie. The extract^ view deter- ; 
Ses which items are extracted by each tapper piece, 
. and hence acts as an interpreter for ™PP e '^f. oft 
Using these views in conjunction with WHIRL s son 
universal quantification, one can compactly state a 
nSmbTof plausible recognition heuristics. One heuns- 
tic is to prefer wrapper pieces that extract many items, 
this trivial but useful heuristic is f.^ 6 ^*^^ 
foLpiece view. Recall that materialising a WHIRL view 
results in a set of new atoms, each with an associated 
score. The fraitfd-piece view can thus be used to gen- 
erate a ranked list of proposed "structures" by simply 
presenting all fmitfuLpiece facts to the user m decreas- 

ine order by score. . , 

Another structure-recognition method is suggested 
by the observation that in most hotlists, the text as- 
sociated with the anchor Is a Sood description of the 
associated object. This suggests the ^chorixke.pxtce 
view, which adds to the frvitftd-piece^w ^an ^itional 
"soft" requirement that the text Textl extracted by the 
wrapper piece be similar to the text Tezt2 assorted 
with the anchor element. 

AfinalTtructure-recognition method is show, i xx v the 
Figure as the Rjfke.piece view. This view is a copy c* 
J&aLpiece in which the requirement that many rtems 
2 'exacted is replaced by a requirement hat many 
"Alike" items are extracted, where an item is R UJ* 
iflt £ slmUar to some second item X that is stored in 
the H)B relation Ji. The "soft- semantics of the many 
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fmilfitL V iece Pafhl,P*tht) «- 
Vssil,lc.picce(P*ihl,P'*?) A 

(Pothla=Pathl A PathStf-PeUM; / 

A elifAnefcorStt, "«", -7 

A ootft/T«t£tt, AnchorEtt, PathS). 



^chorlike.pieet(Paiht,PatK2) *- 

po Jble-P><xe(Poihl,P*th2) A 

fPathla=Peikl A FafWasPaA*.) JL 

ett(Text®l, Text, P*thl) 
A pcth(TextEtt, AnchorBlt, PathS) 
A A Tut~X. 



dt(TextElt, ., - P*W) ou _ ' , 
A patf(Terf£tt, AncAorStt, PotA2j. 

- r — -r-r i— — ' - 




Figure 4: Performance of ranking heuristics that we little 
or no page-specific information, 

construct imply that more credit is given to ««^NS 
items that match an item in R closely, and less credit 
is given for weaker matches. As an .example ..suppose, 
that R contains a list of all accredited universities in the 
US In this case, the RJikcpiece would prefer wrapper 
pieces that extract many items that are similar to some 
known university name; this might be useful in process- 
ing pages like the one shown in Figure 1 . 

Experiments 

Ranking programs with (almost) no 
page-specific information 

We will now evaluate the three ^^^^S 
methods shown in Figure 3. We took the set of 82 
hand-coded list and hotltat wrappers descn^ above, 
and paired each hand-coded wrapper with »m*JJ* 
Ti^^ costly ynm^^^^^t 
hand-coded wrapper programs, and determined which 
wrapper pieces they contained. The resul ; of ; thta pre- 
processing was a list of 82 Web pages, each of which is 
££53 with a set of "meaningful" 
lb evaluate a method, we materialize the appropriate 




Figure 5: Performance of ranking heuristics that use page- 
specific training examples. 




rs.±....... 



fruitful ■ ' 1 
anehortDce — --- 

oldpagelike • p=20.c=3 —> 

oldpageliKe-p=50.©*2 — — 

oldpagslfca ■ p»80,c=1 —- — 



Figure 6; Performance of ranking heuristics that use text 
exSacted from an previous version of the page. 

view, 2 thus generating a ranked list of proposed struc- 
extracted from each page, i.e., ^ "P*** 
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tures A good method is one that ranks meaningful 
Stnrtlhead of non-meaningful structures 

To obtain useful aggregate measures of performance 
U is useful to consider how a structure-recognition 
metU r^ht be used. One possibility is intense 
3? giveTf page to wrap, the method pq^ 
per pieces to a human user, who then examines them m 
K and manually selects pieces to include a t avrap- 

^^^^ 

number of non-meaningful pieces 
of some meaningful piece; or, equivalent^, the average 
Sb« o?£ap?er pieces that would be unnecessarily 
examined (skipped over) by the user. 
Mother possibility use for the system is fah* th- 
riven a page, the method proposes a ^ 
whSi is 3£ used by a calling program without any 
filtering. For batch use, a natural measure of perfor- 
muS L» the percentage of the time that the top-ranked 
Suture » meaningful. Below, we call ^is measure 
Tcnrocv «rmk J, and define error mis at ranfc 1 

^RgSet 'shows the coverage curves obtained from 
methods that require no page-specific ^m^For 
comparison, we also show the perfo rmance 
structure-recognition method (Cohen & Fan 1999). J , W 
summarise this method briefly, structure recognitionis 
reduced to the problem of classifying nodes in an HTML 
Sree tree as to whether or not they are contained 
Kme meaningful structure. The node-c las uficatwn 
problem can then be solved by off-the-shelf 
Suing methods such as CAKT (Brieman etai 1984) 
or RIPPER (Cohen 1995).) This method produces » a 
single wrapper program (which may correspond to mul- 
£& wrapper piec«), ralher than a ranked list of wrap- 
per pieces On the data used here, the wrapper pro- 
posed coincides with the true wrapper, or some close 
approximation of it, on exactly half the cases. 

The anchorlike method* performs quite well, obtain- 
ing accuracy at rank 1 of nearly 70%, and an average of 
0 9 akins (These numbers are summarized in laoie lj. 
Even the strawman fruitful method works ^"f^ly 
well in interactive use, obtaining an average number of 
skips of only 3.3; however, for batch use, its accuracy 

only structure-recognition views that recognize _ hotlbte, and 
apply to each Hst page only views that recopue 
PP 3Note that accuracy at rank 1 is not identocal to carnage 
at K - 1; the former records the number of tunes the top- 
ranked wapp^ piece is part of the ^ the 
[ater records number of times the top-ranked wrapper piece 
is the only piece in the target wrapper. 

4R*caH that the anchorlike can only be W™** 
hotrSTln the curve labeled anchorl^ we used the frwifid 
metod* fo? simple list wrappers, and the onchortee method 
for simple hotlist wrappers. 



at rank 1 is less than 20%. , 
The third curve shown in Figure 4, labeled doma m. 

i. a list of universities, then it «ould be a 
££S l£l( unties.) We consider this structure. 
ScSuitto. method in this section because, althoueti 

frStfon of the problems, the secondary relation R is 
either unavailable or misleading. . . llke with 

The final curve in Figure 4, labeled Momainuke with 
bSrS a simple combination of the *>»^*"»* 
JZhorlike strategies. In this me hod one totmaen- 
alizes the view R-extractedM If it is 
the R Ixkcviece view is materialized, and otherwise, en 
d*M**£Z materialized. This method does as well 
toJbSS setting as domainUk, In an mt 
Sng, it achieves a final coverage of nearly 100% with a 
skip rate somewhat lower than oncftoriifcc 

Ranking structures with training data 
Several previous researchers have considered the prob- 
lem oflearning wrappers from examples (Kushmerick, 
Wdd I & Doorenbos 1997; Hsu 1998; Muslea, Mmton, 

«mrt» of the items that should be extracted from 
f^S Wet pa^e, and the system induces a general 

p\Hure for exacting data from that page I pag* 
specific training examples are avaalable, they can be 
used by storing them in a relation ft, and then apply- 
ZtteRlikc method. This use of structure-recognition 
SodV is quite similar to previous 
svstems- one major difference, however is that no neg 
aSeTxamples need be provided, either explicitly or 

^StvSuate this wrapper-learning techmque, we ran 
the target wrappers on each page in orde to Wj » 
list of page-specific training examples. We ^ 
number of training examples m, and for each Web page, 
sTed m randomly chose^ page-specific examples in he 
Son * and applied the JUfe strurture-recogn.tiOD 
Shod We caU this the »*Ub method Tins 
process was repeated 10 times for each value of m, and 
the results were averaged. 

The results from this experiment are shown in Fig. 
ure 5 Even two or three labeled examples perform 

~ »Tt seems reasonable to assume that the user (or calling 
pro^am)^ gTeral knowled^ ^^^^ 

fca extracted. In the experiments, the items m 
^ jImI SSS from a second Web page containing 

be available. 
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fruitful 
anchortike 



Average 
JE Skips 

0.9 



Accuracy 
at Rank 1 



"I8X 
69.5 



domainince 
with backoff. 



0.6 



84.0 
84.0 



Coverage 
a t K « oo 

— — ioW 

100.0 



98.8 



~ examplelike 
m = 2 
m = 3 
m- 10 



0.3 
0.3 
0.3 



77.8 
79.3 
84.2 



99.0 
99.3 
100.0 



oldpagelike 
p = 20,c = 3 
p = 50,c = 2 
p = 80,c = 1 



0.3 
0.3 

0.3 



85.0 
82.9 
85.4 



96.1 
995 
100.0 



Table 1: Summary of results 



Oklahoma 
Yukon 
Vermont 
British. Columbia 
Oklahoma 
Wisconsin 
Hew Jersey 
Alaska 

New Brunswick 
New Mexico 



somewhat better than the anehcrlfte and fruitful meth- 

plete (or nearly complete) coverage. However the av 
erage accuracy at rank 1 is not as high as for the do- 
mainlike method, unless many examples are used. 

These results show an advantage to presenting the 
user with a ranked list of wr £ 
coverage is improved much more by increasing a than 
bTSSns m. For example, if the user labek two 
Samples, then 58.6% of the pages are wrapped cor- 
S using the top-ranked wrapper piece alone. Pro- 
viding eight more examples increases coverage of the 
Sf%& piece to only 63.3%; however, rf theuser 
Ub* no additional examples, but instead considers the 
top too wrapper pieces, coverage jumps to 89.4%. 

Maintaining a wrapper 

Because Web pages frequently change,^ ""^tatatag » 
iatine wrappers is a time-consuming process. In tnis 
Stl, ^'consider the problem of *P^« 
Eg wrapper for a Web page that has changed Here : a 
new source of information is potntady 
could retain, for each wrapper, the data that was ex- 
tracted from the previous version of the page. If the 
Smat of the page ha* been changed but not its con- 
tent, then the previously-extracted data ^ J>« u £ d 
as page-specific training examples for the new page for- 
mat^lnd the myUfe method of the previous section 
Tan be used to derive a new wrapper^ «™* 
content both change, then the data extracted from the 
eld version of the page could still be used; however it 
would be-only an approximation to the examples that 
I Set would provide. Using such "approximate exam- 
nl? wffl presumably make structure-recognitton more 
diScuU; on the other hand, there will typically be many 
more examples than a user would provide. 

Motivated by these observations, we evaluated _ the 
JUtt. structurlreccsnition 

huge number of entries, each of which is a «m»P*eJ 
vSTS a data item that should be extra cted from 
Se page- Specifically, we began with a list ofaU data 
Uems that are extracted by the target wrapper, and 



dietitians 

Yukon codpiece m 

Vermont € if ' 

British Columbia Talmudizaiions 

Oklahoma 

Wisconsin , 
New Jersey Incorrigible blubber 

Alaska 



New Mexico cryptogram. 



Table 2: Ten US States and Canadian Provinces, before 
and after corruption with c - 1. 

then corrupted this list as follows. FiwV m ^discarded 
all but randomly-chosen percentage p of *J« ^ 
We next perform" c • » random 
„ fa the number of retained examples. ^ ^ °£ 
eration randomly selects one of the n items , and then 
either deletes a randomly chosen word from the item, 
ofelse^ ds a word chosen uniformly at random from 

/U ?^Zts the results of performing this exper-. 
imeT(again averaged over 10 runs) with values of p 
Sng from 80% 4 20%, and values of c ranging from 
1^3 We call this structure-recognition method the 
MpagelZ Sod. With moderately «^ ~ 
ole sets the method performs very well: even with a 
pie sew., lus «» r c = 2 it performs bet- 

cormption level of p = 50 /o ana c - * " v 
ter on average than the anchorlike method. 
K^STSSotad, however, that the corrupted exam- 

S b. TseenSre typical modifications are harder or 
easier to recover from. 

Conclusions 
In this paper, we considered the problem of recognte- 
-structure" in HTML pages. As formulated here, 
Sfucture recognition is closely related to the taskof au- 
tomatically constructing wrappers; m ^^JJ™^ 
a "ctTucture" is equated with a component of a wrap- 
USSSifia stru«ure is considered Wjr 
fill" if it is part of an existing wrapper for that page. 
We used WHIRL, a "soft" logic that mcorporates a 
notiS of textual similarity developed in ^fano* 
S° etrleval community, to implement "^^"J* 
methods for recognizing ^™J^£%£* 
useful class. Implementing these methods also requirea 
an extenfon toWHIRL-a "soft" version of bounded 

been extracted from an old version of the page. 
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on simpler ranking schemes for 3tmcm«, and ^so ^j- 
proving on an earlier result of ours which used more 
Srantional methods for recognizing structure. This 
Sod is completely general and "V*"™™* 
Specific Information. A second strurttare-r^gmUon 
method, the RJ*e method, was also tented ^ winch 
make use of Information of many ^ ffe ^ f ^ s " ex- 
amples of correctly-extracted text; an °*-f da " ver- 
Son of the wrapper, together with a cached version of 
Sit Web pag« that this out-of-date version correctly 
wrapped; or a list of objects of the same type as those 
Stiffl b. extracted from the Web P^.jf o ^ 
these cases, performance can be improved beyond that 
obtained by the anchorlike method. 
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□ 1. Document ID: US 20030199685 Al 

L13: Entry 1 of 70 File: PGPB 

PGPUB-DOCUMENT-NUMBER: 20030199685 
PGPUB-FILING-TYPE : new 

DOCUMENT-IDENTIFIER: US 20030199685 Al 

TITLE: Cell-based detection and differentiation of disease states 
PUBLICATION-DATE: October 23, 2003 



INVENTOR-INFORMATION: 
NAME 

Pressman, Norman J. 
Hirsch, Kenneth S. 
Hirsch, Adrian 



CITY STATE COUNTRY 

Glencoe IL US 

Redwood City CA US 

Redwood City CA US 



RULE-47 



US-CL-CURRENT: 536/24 .3; 435/69. 1, 435/91.1 
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□ 2. Document ID: US 20030191608 Al 

L13: Entry 2 of 70 



File: PGPB 



Oct 9, 2003 



PGPUB-DOCUMENT-NUMBER: 20030191608 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030191608 Al 

TITLE: Data processing and observation system 

PUBLICATION-DATE: October 9, 2003 



I N VENTOR- I N FORMAT I ON : 
NAME 

Anderson, Mark Stephen 
Engelhardt, Dean Crawford 
Marriott, Damian Andrew 
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US-CL-CURRENT: 702/189 



CITY 

Edinburgh 
Edinburgh 
Edinburgh 
Edinburgh 



STATE 



COUNTRY 

AU 

AU 

AU 

AU 



RULE-47 



Reference | Sequences | Attachments | Claims | KMC j Oom Desc I In 



□ 3. Document ID: US 20030190602 Al 



1 of 22 



11/4/03 4:39 PM 



Record List Display 



http://westbrs:8002ftin/gate^^^ 
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File: PGPB 



Oct 9, 2003 



PGPUB-DOCUMENT-NUMBER: 20030190602 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030190602 Al 

TITLE: Cell-based detection and differentiation of disease states 
PUBLICATION-DATE: October 9, 2003 



INVENTOR- INFORMATION : 
NAME 

Pressman, Norman J. 
Hirsch, Kenneth S. 



CITY STATE COUNTRY 

Glencoe IL US 

Redwood City CA US 



RULE-47 



US-CL-CURRENT: 435/5; 435/287.2, 435/6, 435/7,23, 435/7.92 
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□ 4. Document ID: US 20030187642 Al 

L13: Entry 4 of 70 File: PGPB 



Oct 2, 2003 



PG PUB- DOCUMENT-NUMBER : 20030187642 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030187642 Al 

TITLE: System and method for the automatic discovery of salient segments in speech 
transcripts 

PUBLICATION-DATE: October 2, 2003 

INVENTOR- INFORMATION: 
NAME 

Ponceleon, Dulce Beatriz 
Srinivasan, Savitha 



CITY STATE COUNTRY RULE-4 7 

Palo Alto CA US 

San Jose CA US 



US-CL-CURRENT: 704/252 



Review Classification 



□ 5. Document ID: US 20030176931 Al 

L13: Entry 5 of 70 File: PGPB Sep 18, 2003 

PGPUB-DOCUMENT-NUMBER: 20030176931 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030176931 Al 

TITLE: Method for constructing segmentation-based predictive models 
PUBLICATION-DATE: September 18, 2003 
INVENTOR- IN FORMAT I ON: 
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PGPUB-DOCUMENT-NUMBER: 200301587 95 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030158795 Al 

TITLE: Quality management and intelligent manufacturing with labels and smart tags 
in event-based product manufacturing 

PUBLICATION-DATE: August 21, 2003 

INVENTOR-INFORMATION: 

RULE-47 
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Markham, Charles Earl 
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PGPUB-DOCUMENT-NUMBER : 20030155415 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030155415 Al 
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PUBLICATION- DATE: August 21, 2003 
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□ 8. Document ID: US 20030154144 Al 
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File: PGPB 



Aug 14, 2003 



PGPUB- DOCUMENT-NUMBER : 20030154144 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030154144 Al 

TITLE: Integrating event-based production information with financial and purchasing 
systems in product manufacturing 

PUBLICATION-DATE: August 14, 2003 

INVENTOR- IN FORMAT I ON : 

RULE- 4 7 



NAME 


CITY 


STATE 


COUNTRY 


Pokorny, Michael Roy 


Neenah 


WI 


US 


Barber, Douglas Gordon Barron 


Appleton 
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US 


Bush, Perry A. 
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WI 


US 


Hise, John Harland 


Neenah 
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US 


Shun Hoo, Winnie Shi Mei 
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Markham, Charles Earl 
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Matheus, Jon Ray 
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□ 10. Document ID: US 20030150908 Al 

L13: Entry 10 of 70 File: PGPB Aug 14, 2003 

PGPUB- DOCUMENT -NUMBER : 20030150908 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030150908 Al 

TITLE: User interface for reporting event-based production information in product 
manufacturing 

PUBLICATION-DATE: August 14, 2003 
I N VENTOR- I N FORMAT I ON : 
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□ 11. Document ID: US 20030144746 Al 

L13: Entry 11 of 70 File: PGPB Jul 31, 2003 

PGPUB- DOCUMENT-NUMBER : 20030144746 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030144746 Al 

TITLE: Control for an industrial process using one or more multidimensional 
variables 

PUBLICATION-DATE: July 31, 2003 



INVENTOR- INFORMATION : 

NAME CITY STATE COUNTRY RULE-47 

Hsiung, Chang-Meng Irvine CA US 

Munoz, Bethsabeth Pasadena CA US 

Roy, Ajoy Pasadena CA US 

Steinthal, Michael Los Angeles CA US 

Sunshine, Steven Pasadena CA US 

Vicic, Michael Allen Pasadena CA US 

Zhang, Shou-Hua Arcadia CA US 
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□ 12. Document ID: US 200301 12921 Al 

L13: Entry 12 of 70 File: PGPB Jun 19, 2003 

PGPUB-DOCUMENT-NUMBER: 20030112 921 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030112921 Al 

TITLE: Methods and devices for analysis of x-ray images 
PUBLICATION-DATE: June 19, 2003 
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□ 13. Document ID: US 20030109951 Al 

L13: Entry 13 of 70 File: PGPB Jun 12, 2003 

PGPUB- DOCUMENT-NUMBER: 20030109951 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030109951 Al 

TITLE: Monitoring system for an industrial process using one or more 
multidimensional variables 

PUBLICATION-DATE: June 12, 2003 

INVENTOR- IN FORMAT I ON: 
NAME 

Hsiung, Chang-Meng B. 
Munoz, Bethsabeth 
Roy, A joy Kumar 
Steinthal, Michael Gregory 
Sunshine, Steven A. 
Vicic, Michael Allen 
Zhang, Shou-Hua 



CITY 

Irvine 

Pasadena 

Pasadena 

Los Angeles 
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CA US 

CA US 

CA US 

CA US 

CA US 

CA US 

CA US 
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□ 14. Document ID: US 20030104499 Al 

L13: Entry 14 of 70 File: PGPB 

PGPUB- DOCUMENT-NUMBER : 20030104499 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030104499 Al 

TITLE: Cell-based detection and differentiation of lung cancer 
PUBLICATION-DATE: June 5, 2003 



Jun 5, 2003 



INVENTOR-INFORMATION: 
NAME 

Pressman, Norman J. 
Hirsch, Kenneth S. 
Hirsch, Adrian 



CITY 
Glencoe 
Redwood City 
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STATE COUNTRY 

IL US 

CA US 

CA US 
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□ 15. Document ID: US 20030083756 Al 

L13: Entry 15 of 70 File: PGPB 

PGPUB-DOCUMENT-NUMBER: 20030083756 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030083756 Al 

TITLE: Temporary expanding integrated monitoring network 
PUBLICATION-DATE: May 1, 2003 
INVENTOR- IN FORMAT I ON: 



NAME 
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□ 16. Document ID: US 20030069877 Al 

L13: Entry 16 of 70 File: PGPB 

PGPUB-DOCUMENT-NUMBER: 20030069877 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030069877 Al 

TITLE: System for automatically generating queries 

PUBLICATION-DATE: April 10, 2003 



INVENTOR-INFORMATION : 
NAME 

Gref enstette, Gregory T. 
Shanahan, James G. 

US-CL-CURRENT : 707/2 



CITY 

Gieres 

Pittsburgh 



STATE 
PA 



COUNTRY 

FR 

US 



Apr 10, 2003 



RULE-47 
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□ 17. Document ID: US 20030065409 Al 

L13: Entry 17 of 70 File: PGPB 



Apr 3, 2003 
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PGPUB-DOCUMENT-NUMBER: 200300654 09 
PGPUB-FILING-TYPE : new 

DOCUMENT-IDENTIFIER: US 20030065409 Al 

TITLE: Adaptively detecting an event of interest 

PUBLICATION-DATE: April 3, 2003 



I N VENTOR- 1 N FORMAT ION : 
NAME 

Raeth, Peter G. 
Bostick, Randall L. 
Bertke, Donald Allen 



CITY 

Beavercreek 

Springboro 

Beavercreek 



STATE 
OH 
OH 
OH 



COUNTRY 

US 

US 

US 



RULE- 4 7 



US-CL-CURRENT: 700/31; 700/28, 700/30, 700/44 
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□ 18. Document ID: US 20030061201 Al 

L13: Entry 18 of 70 File: PGPB 

PGPUB-DOCUMENT-NUMBER: 20030061201 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030061201 Al 

TITLE: System for propagating enrichment between documents 
PUBLICATION-DATE: March 27, 2003 



Mar 27, 2003 



INVENTOR-INFORMATION: 
NAME 

Gref enstette, Gregory T. 
Shanahan, James G. 

US-CL-CURRENT: 707/3 



CITY 

Gieres 

Pittsburgh 



STATE COUNTRY 
PA FR 
US 



RULE- 4 7 
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□ 19. Document ID: US 20030061200 Al 

L13: Entry 19 of 70 File: PGPB Mar 27, 2003 

PGPUB-DOCUMENT-NUMBER: 20030061200 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030061200 Al 

TITLE: System with user directed enrichment and import /export control 
PUBLICATION-DATE: March 27, 2003 
INVENTOR-INFORMATION : 

NAME CITY STATE COUNTRY RULE-47 

Hubert, Laurence St Bernard du Touvet FR 

Guerin, Nicolas Grenoble FR 
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□ 20. Document ID: US 20030051026 Al 

L13: Entry 20 of 70 File: PGPB Mar 13, 2003 

PGPUB- DOCUMENT-NUMBER: 20030051026 
PGPUB-FILING-TYPE: new 

DOCUMENT- IDENTIFIER: US 20030051026 Al 

TITLE: Network surveillance and security system 

PUBLICATION-DATE : March 13, 2003 

INVENTOR-INFORMATION : 

NAME CITY 

Carter, Ernst B. San Francisco 

Zolotov, Vasily San Francisco 



STATE COUNTRY RULE-47 
CA US 
CA US 



US-CL-CURRENT: 709/224; 706/909, 713/201 



Reference I Sequences 



□ 21. Document ID: US 20030046421 Al 

L13: Entry 21 of 70 File: PGPB 



Mar 6, 2003 



PGPUB- DOCUMENT-NUMBER : 20030046421 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030046421 Al 

TITLE: Controls and displays for acquiring preferences, inspecting behavior, and 
guiding the learning and decision policies of an adaptive communications 
prioritization and routing system 

PUBLICATION-DATE: March 6, 2003 

INVENTOR-INFORMATION: 

NAME CITY STATE COUNTRY RULE-47 

Horvitz, Eric J. Kirkland WA US 

Baribault, Gregory P. Lynnwood WA US 



US-CL-CURRENT: 709/238; 709/206 
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□ 22. Document ID: US 20030033347 Al 

L13: Entry 22 of 70 File: PGPB Feb 13, 2003 

PGPUB- DOCUMENT-NUMBER: 2003003334 7 
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PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030033347 Al 

TITLE: Method and apparatus for inducing classifiers for multimedia based on unified 
representation of features reflecting disparate modalities 

PUBLICATION-DATE: February 13, 2003 



I NVENTOR- 1 N FORMAT I ON : 
NAME 

Bolle, Rudolf M. 
Haas, Norman 
Oles, Frank J. 
Zhang, Tong 



CITY 

Bedford Hills 
Mount Kisco 
Peekskill 
Tuckahoe 



STATE 

NY 

NY 

NY 

NY 



COUNTRY 

US 

US 

US 

US 



RULE-47 
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□ 23. Document ID: US 20030033288 Al 

L13: Entry 23 of 70 File: PGPB 



Feb 13, 2003 



PGPUB-DOCUMENT-NUMBER: 20030033288 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030033288 Al 

TITLE: Document-centric system with auto-completion and auto-correction 
PUBLICATION-DATE: February 13, 2003 



INVENTOR-INFORMATION: 
NAME 

Shanahan, James G. 

Gref enstette, Gregory T. 

US-CL-CURRENT: 707/3 



CITY 

Pittsburgh 
Gieres 



STATE 
PA 



COUNTRY 

US 

FR 



RULE-47 
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L13: Entry 24 of 70 File: PGPB Feb 13, 2003 

PGPUB-DOCUMENT-NUMBER: 20030033287 
PGPUB-FILING-TYPE: new 

DOCUMENT-IDENTIFIER: US 20030033287 Al 

TITLE: Meta-document management system with user definable personalities 
PUBLICATION-DATE: February 13, 2003 
INVENTOR-INFORMATION : 
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Shanahan, James G. 
Gref enstette, Gregory T. 
Ferns trom, Christer 
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